Raúl van Harmelen: 14665387
Elianto Dijk: 14434091
Fardeen Wieliams: 13770160
Sharynho Markelo: 14579073
import csv
import networkx as nx
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import preprocessing
import statsmodels
Have you ever wondered how people interact with each other? How do these interactions differ online and offline? These types of questions concern network science. A network is a web that consists of nodes and edges. People, devices, accounts or other objects can be represented as nodes and the edges can be represented as the relationships between these nodes. All these nodes and edges together form a network. Network science studies the relationships between these nodes. In this project, different social networks will be analysed. The following two perspectives in this project are:
In order to analyse both perspectives, different network properties are calculated. Leaping into the future, we were able to successfully compute all these different network properties. From network science it appears that networks emerged from real social interactions have different characteristics compared to random networks. A distinction must be made between random networks and real networks. According to Barabasi (2016), a random network consists of N nodes where each node pair is connected with probability p. Because a random network is constructed by chance, the characteristics are typical for this randomness. Barabasi (2016) also explains that real networks have certain characteristics since that cannot be explained by random network models. The reason for this is that real networks are not constructed by chance, but are a result of social interactions which cannot be reacreated by computers.
If online social interactions form networks similar to real networks, then online social interactions are similar to real social interactions. If not, it means social interactions are different from real interactions.
This project is based on three different datasets of social networks that can be found on Kaggle (Appendix). Each dataset is a representation of a network in the form of a list of edges. To some datasets there is also a file added with node attributes. The latest dataset is slightly different from the first two, because it is a dataset containing multiple datasets of social networks.
Before the datasets can be read in, it is necessary to manually select which network datasets will and will not be selected for the project. This selection is based on the size of the datasets to reduce the risk of very long processing times. Besides, any not connected network is excluded from the project to simplify calculation tasks. The table below is an overview of all the different networks, retrieved from the datasets, that will be used in this project.
| Dataset | Node representation | Link representation | Node attributes if applicable |
|---|---|---|---|
| NashvilleMeetupNetwork | Member of a Meetup group | Shared group membership in 'weight' groups | N/A |
| DeezerHR | Deezer users from Croatia | The relation friendship | Genre preferences of each user |
| DeezerHU | Deezer users from Hungary | The relation friendship | Genre preferences of each user |
| DeezerRO | Deezer users from Romania | The relation friendship | Genre preferences of each user |
| FacebookLargePage | Official Facebook pages | The amount of mutual likes between pages | Descriptions of the purpose of the site |
| FeatherDeezerSocial | Deezer users from Europe | The relation friendship | Artists liked by the users |
| FeatherLastfmSocial | Lastfm users from Asia | The relation friendship | Artists liked by the users |
| GithubSocial | Developers who have starred at least 10 repositories | Mutual follwing between developers | Location, Repositories starred, Employer and E-mail address |
| TwitchSocialNetworks | Twitch users | The relation friendship | Games played and liked, Location and Streaming habits |
| TwitchSocialNetworksDE | Twitch users from Germany | The relation friendship | Games played and liked, Location and Streaming habits |
| TwitchSocialNetworksENGB | Twitch users from England | The relation friendship | Games played and liked, Location and Streaming habits |
| TwitchSocialNetworksES | Twitch users from Spain | The relation friendship | Games played and liked, Location and Streaming habits |
| TwitchSocialNetworksFR | Twitch users from France | The relation friendship | Games played and liked, Location and Streaming habits |
| TwitchSocialNetworksPTBR | Twitch users from Brazil | The relation friendship | Games played and liked, Location and Streaming habits |
| TwitchSocialNetworksRU | Twitch users from Russia | The relation friendship | Games played and liked, Location and Streaming habits |
| SocEpinions1 | Epinion users | Trust relationship between users | N/A |
| SocSignSlashdot090221 | Slashdot users 2009 | Friend or enemy relationship | N/A |
| SocSlashdot0902 | Slashdot users 2009 | Friend or enemy relationship | N/A |
Some datasets of networks have their edges saved in a .txt file while others in a .csv file. In order to read the edges with the pandas read_csv function those .txt files need to be converted to .csv files. A function is used to accomplish this that can be found in the preprocessing.py file.
# preprocessing.transform_text_to_csv('datasets/soc-sign-Slashdot090221.txt', 'datasets/soc-sign-Slashdot090221.csv')
# preprocessing.transform_text_to_csv('datasets/soc-Slashdot0902.txt', 'datasets/soc-Slashdot0902.csv')
The datasets can be easily read in with pandas.
NashvilleMeetupNetwork_member_edges = pd.read_csv('datasets/NashvilleMeetupNetwork/member-edges.csv', index_col=0)
DeezerHR_edges = pd.read_csv('datasets/DeezerSocialNetworks/HR/HR_edges.csv')
DeezerHU_edges = pd.read_csv('datasets/DeezerSocialNetworks/HU/HU_edges.csv')
DeezerRO_edges = pd.read_csv('datasets/DeezerSocialNetworks/RO/RO_edges.csv')
FacebookLargePage_edges = pd.read_csv('datasets/facebook-large-page-page-network/musae_facebook_edges.csv')
FeatherDeezerSocial_edges = pd.read_csv('datasets/feather-deezer-social/deezer_europe_edges.csv')
FeatherLastfmSocial_edges = pd.read_csv('datasets/feather-lastfm-social/lastfm_asia_edges.csv')
GithubSocial_edges = pd.read_csv("datasets/github-social/musae_git_edges.csv")
TwitchSocialNetworksDE_edges = pd.read_csv("datasets/twitch-social-networks/DE/musae_DE_edges.csv")
TwitchSocialNetworksENGB_edges = pd.read_csv("datasets/twitch-social-networks/ENGB/musae_ENGB_edges.csv")
TwitchSocialNetworksES_edges = pd.read_csv("datasets/twitch-social-networks/ES/musae_ES_edges.csv")
TwitchSocialNetworksFR_edges = pd.read_csv("datasets/twitch-social-networks/FR/musae_FR_edges.csv")
TwitchSocialNetworksPTBR_edges = pd.read_csv("datasets/twitch-social-networks/PTBR/musae_PTBR_edges.csv")
TwitchSocialNetworksRU_edges = pd.read_csv("datasets/twitch-social-networks/RU/musae_RU_edges.csv")
SocSignSlashdot090221_edges = pd.read_csv('datasets/soc-sign-Slashdot090221.csv')
SocSlashdot0902_edges = pd.read_csv('datasets/soc-Slashdot0902.csv')
After that, undirected networkx graphs can be constructed from the pandas dataframes.
NashvilleMeetupNetwork = nx.from_pandas_edgelist(NashvilleMeetupNetwork_member_edges, 'member1', 'member2', edge_attr='weight', create_using=nx.Graph)
DeezerHR = nx.from_pandas_edgelist(DeezerHR_edges, 'node_1', 'node_2', create_using=nx.Graph)
DeezerHU = nx.from_pandas_edgelist(DeezerHU_edges, 'node_1', 'node_2', create_using=nx.Graph)
DeezerRO = nx.from_pandas_edgelist(DeezerRO_edges, 'node_1', 'node_2', create_using=nx.Graph)
FacebookLargePage = nx.from_pandas_edgelist(FacebookLargePage_edges, 'id_1', 'id_2', create_using=nx.Graph)
FeatherDeezerSocial = nx.from_pandas_edgelist(FeatherDeezerSocial_edges, 'node_1', 'node_2', create_using=nx.Graph)
FeatherLastfmSocial = nx.from_pandas_edgelist(FeatherLastfmSocial_edges, 'node_1', 'node_2', create_using=nx.Graph)
GithubSocial = nx.from_pandas_edgelist(GithubSocial_edges, 'id_1', 'id_2', create_using=nx.Graph)
TwitchSocialNetworksDE = nx.from_pandas_edgelist(TwitchSocialNetworksDE_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksENGB = nx.from_pandas_edgelist(TwitchSocialNetworksENGB_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksES = nx.from_pandas_edgelist(TwitchSocialNetworksES_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksFR = nx.from_pandas_edgelist(TwitchSocialNetworksFR_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksPTBR = nx.from_pandas_edgelist(TwitchSocialNetworksPTBR_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksRU = nx.from_pandas_edgelist(TwitchSocialNetworksRU_edges, 'from', 'to', create_using=nx.Graph)
SocSignSlashdot090221 = nx.from_pandas_edgelist(SocSignSlashdot090221_edges, '# FromNodeId', 'ToNodeId', edge_attr='Sign' , create_using=nx.Graph)
SocSlashdot0902 = nx.from_pandas_edgelist(SocSlashdot0902_edges, '# FromNodeId', 'ToNodeId', create_using=nx.Graph)
The final part of the preprocessing part consists of calculating different network properties for each network. The visualizations section explains what the different properties mean in more detail.
networks = {
'NashvilleMeetupNetwork' : NashvilleMeetupNetwork,
'DeezerHR' : DeezerHR,
'DeezerHU' : DeezerHU,
'DeezerRO' : DeezerRO,
'FacebookLargepage' : FacebookLargePage,
'FeatherDeezerSocial' : FeatherDeezerSocial,
'FeatherLastfmSocial' : FeatherLastfmSocial,
'GithubSocial' : GithubSocial,
'TwitchSocialNetworksDE' : TwitchSocialNetworksDE,
'TwitchSocialNetworksENGB' : TwitchSocialNetworksENGB,
'TwitchSocialNetworksES' : TwitchSocialNetworksES,
'TwitchSocialNetworksFR' : TwitchSocialNetworksFR,
'TwitchSocialNetworksPTBR' : TwitchSocialNetworksPTBR,
'TwitchSocialNetworksRU' : TwitchSocialNetworksRU,
'SocSignSlashdot090221' : SocSignSlashdot090221,
'SocSlashdot0902' : SocSlashdot0902,
}
The code below uses networkx and will take some time to run. That is why the results are written to a CSV-file that can be read in later.
# start = time.time()
# descriptive_stats = pd.DataFrame(index=list(networks.keys()), columns=['number_of_nodes', 'number_of_edges', 'average_degree', 'density', 'connected_network', 'avg_cc', 'transitivity', 'average_shortest_path_length', 'diameter', 'has_bridges', 'number_of_bridges', 'max_degree'])
# for name, network in networks.items():
# descriptive_stats['number_of_nodes'].loc[name] = nx.number_of_nodes(network)
# descriptive_stats['number_of_edges'].loc[name] = nx.number_of_edges(network)
# descriptive_stats['average_degree'].loc[name] = np.mean(list(dict(network.degree()).values()))
# descriptive_stats['density'].loc[name] = nx.density(network)
# descriptive_stats['connected_network'].loc[name] = nx.is_connected(network)
# descriptive_stats['avg_cc'].loc[name] = nx.average_clustering(network)
# descriptive_stats['transitivity'].loc[name] = nx.transitivity(network)
# descriptive_stats['has_bridges'].loc[name] = nx.has_bridges(network)
# descriptive_stats['number_of_bridges'].loc[name] = len(list(nx.bridges(network)))
# descriptive_stats['max_degree'].loc[name] = max(dict(nx.degree(network)).values())
# descriptive_stats.to_csv('results/descriptive_stats_of_networks_2.csv')
# end = time.time()
# print(f'tijd in seconden:{end - start}')
Networkx can be used to calculate many different network properties as seen above. However, some network properties require more efficient algorithms to compute.
Below this cell all the code can be found that was used to calculated additional network properties of the network. This code was run in a google Colab notebook because this allows us to use Google's GPU for gpu accelerated computing. Instead of pandas and networkx, the python libaries cudf and cugraph were used.
# import cudf
# import cugraph as cnx
# import time
# import numpy as np
# import json
# import itertools
# # Reading in the data with cudf
# NashvilleMeetupNetwork_member_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/NashvilleMeetupNetwork/member-edges.csv', index_col=0)
# DeezerHR_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/DeezerSocialNetworks/HR/HR_edges.csv')
# DeezerHU_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/DeezerSocialNetworks/HU/HU_edges.csv')
# DeezerRO_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/DeezerSocialNetworks/RO/RO_edges.csv')
# FacebookLargePage_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/facebook-large-page-page-network/musae_facebook_edges.csv')
# FeatherDeezerSocial_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/feather-deezer-social/deezer_europe_edges.csv')
# FeatherLastfmSocial_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/feather-lastfm-social/lastfm_asia_edges.csv')
# GithubSocial_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/github-social/musae_git_edges.csv")
# TwitchSocialNetworksDE_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/DE/musae_DE_edges.csv")
# TwitchSocialNetworksENGB_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/ENGB/musae_ENGB_edges.csv")
# TwitchSocialNetworksES_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/ES/musae_ES_edges.csv")
# TwitchSocialNetworksFR_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/FR/musae_FR_edges.csv")
# TwitchSocialNetworksPTBR_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/PTBR/musae_PTBR_edges.csv")
# TwitchSocialNetworksRU_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/RU/musae_RU_edges.csv")
# SocSignSlashdot090221_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/soc-sign-Slashdot090221.csv')
# SocSlashdot0902_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/soc-Slashdot0902.csv')
# # Constructing networks with cugraph
# NashvilleMeetupNetwork = cnx.from_cudf_edgelist(NashvilleMeetupNetwork_member_edges, 'member1', 'member2', edge_attr='weight', create_using=cnx.Graph)
# DeezerHR = cnx.from_cudf_edgelist(DeezerHR_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# DeezerHU = cnx.from_cudf_edgelist(DeezerHU_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# DeezerRO = cnx.from_cudf_edgelist(DeezerRO_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# FacebookLargePage = cnx.from_cudf_edgelist(FacebookLargePage_edges, 'id_1', 'id_2', create_using=cnx.Graph)
# FeatherDeezerSocial = cnx.from_cudf_edgelist(FeatherDeezerSocial_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# FeatherLastfmSocial = cnx.from_cudf_edgelist(FeatherLastfmSocial_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# GithubSocial = cnx.from_cudf_edgelist(GithubSocial_edges, 'id_1', 'id_2', create_using=cnx.Graph)
# TwitchSocialNetworksDE = cnx.from_cudf_edgelist(TwitchSocialNetworksDE_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksENGB = cnx.from_cudf_edgelist(TwitchSocialNetworksENGB_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksES = cnx.from_cudf_edgelist(TwitchSocialNetworksES_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksFR = cnx.from_cudf_edgelist(TwitchSocialNetworksFR_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksPTBR = cnx.from_cudf_edgelist(TwitchSocialNetworksPTBR_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksRU = cnx.from_cudf_edgelist(TwitchSocialNetworksRU_edges, 'from', 'to', create_using=cnx.Graph)
# SocSignSlashdot090221 = cnx.from_cudf_edgelist(SocSignSlashdot090221_edges, '# FromNodeId', 'ToNodeId', edge_attr='Sign' , create_using=cnx.Graph)
# SocSlashdot0902 = cnx.from_cudf_edgelist(SocSlashdot0902_edges, '# FromNodeId', 'ToNodeId', create_using=cnx.Graph)
# networks = {
# 'NashvilleMeetupNetwork' : NashvilleMeetupNetwork,
# 'DeezerHR' : DeezerHR,
# 'DeezerHU' : DeezerHU,
# 'DeezerRO' : DeezerRO,
# 'FacebookLargepage' : FacebookLargePage,
# 'FeatherDeezerSocial' : FeatherDeezerSocial,
# 'FeatherLastfmSocial' : FeatherLastfmSocial,
# 'GithubSocial' : GithubSocial,
# 'TwitchSocialNetworksDE' : TwitchSocialNetworksDE,
# 'TwitchSocialNetworksENGB' : TwitchSocialNetworksENGB,
# 'TwitchSocialNetworksES' : TwitchSocialNetworksES,
# 'TwitchSocialNetworksFR' : TwitchSocialNetworksFR,
# 'TwitchSocialNetworksPTBR' : TwitchSocialNetworksPTBR,
# 'TwitchSocialNetworksRU' : TwitchSocialNetworksRU,
# 'SocSignSlashdot090221' : SocSignSlashdot090221,
# 'SocSlashdot0902' : SocSlashdot0902,
# }
# # Code used to compute the average shortest path length and diameter
# start = time.time()
# descriptive_stats = cudf.DataFrame(index=list(networks.keys()), columns=['average_shortest_path_length', 'diameter'])
# for name, network in networks.items():
# shortest_path_lengths = [cnx.bfs_edges(network, source=x)['distance'].max() for x in network.nodes().to_numpy()]
# descriptive_stats['average_shortest_path_length'].loc[name] = sum(shortest_path_lengths)/len(shortest_path_lengths)
# descriptive_stats['diameter'].loc[name] = max(shortest_path_lengths)
# descriptive_stats.to_csv('/content/drive/MyDrive/Colab Notebooks/results/descriptive_stats_of_networks.csv')
# end = time.time()
# print(f'tijd in seconden:{end - start}')
Combining the calculatings from networkx and cugraph gives the following dataframe:
descriptive_stats = pd.read_csv('results/descriptive_stats_of_networks.csv', index_col=0)
descriptive_stats
| number_of_nodes | number_of_edges | average_degree | density | connected_network | avg_cc | transitivity | average_shortest_path_length | diameter | has_bridges | number_of_bridges | max_degree | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NashvilleMeetupNetwork | 11372 | 1176368 | 206.888498 | 0.018194 | True | 0.884957 | 0.604407 | 4.413648 | 6 | True | 13 | 2356 |
| DeezerHR | 54573 | 498202 | 18.258186 | 0.000335 | True | 0.136477 | 0.114630 | 8.501347 | 12 | True | 2435 | 420 |
| DeezerHU | 47538 | 222887 | 9.377214 | 0.000197 | True | 0.116187 | 0.092924 | 9.930603 | 14 | True | 2869 | 112 |
| DeezerRO | 41773 | 125826 | 6.024274 | 0.000144 | True | 0.091212 | 0.075267 | 12.970340 | 19 | True | 6403 | 112 |
| FacebookLargepage | 22470 | 171002 | 15.220472 | 0.000677 | True | 0.359738 | 0.232321 | 10.443525 | 15 | True | 2973 | 709 |
| FeatherDeezerSocial | 28281 | 92752 | 6.559315 | 0.000232 | True | 0.141160 | 0.095922 | 15.138963 | 21 | True | 6470 | 172 |
| FeatherLastfmSocial | 7624 | 27806 | 7.294334 | 0.000957 | True | 0.219418 | 0.178623 | 10.457240 | 15 | True | 1929 | 216 |
| GithubSocial | 37700 | 289003 | 15.331724 | 0.000407 | True | 0.167537 | 0.012357 | 7.228674 | 11 | True | 5241 | 9458 |
| TwitchSocialNetworksDE | 9498 | 153138 | 32.246368 | 0.003395 | True | 0.200886 | 0.046471 | 5.085176 | 7 | True | 453 | 4259 |
| TwitchSocialNetworksENGB | 7126 | 35324 | 9.914117 | 0.001391 | True | 0.130928 | 0.042433 | 7.001684 | 10 | True | 1222 | 720 |
| TwitchSocialNetworksES | 4648 | 59382 | 25.551635 | 0.005499 | True | 0.222496 | 0.084235 | 6.649312 | 9 | True | 301 | 1022 |
| TwitchSocialNetworksFR | 6549 | 112666 | 34.407085 | 0.005255 | True | 0.221706 | 0.054128 | 5.089327 | 7 | True | 257 | 2040 |
| TwitchSocialNetworksPTBR | 1912 | 31299 | 32.739540 | 0.017132 | True | 0.319895 | 0.130981 | 4.905335 | 7 | True | 116 | 767 |
| TwitchSocialNetworksRU | 4385 | 37304 | 17.014367 | 0.003881 | True | 0.165797 | 0.048648 | 6.288940 | 9 | True | 461 | 1229 |
| SocSignSlashdot090221 | 82140 | 500481 | 12.186048 | 0.000148 | True | 0.058787 | 0.023618 | 8.911346 | 13 | True | 30086 | 2548 |
| SocSlashdot0902 | 82168 | 582533 | 14.179072 | 0.000173 | True | 0.060345 | 0.024109 | 8.910720 | 13 | True | 30035 | 2554 |
Degree distribution of networks.
Firstly, we will look at the amount of acquaintances of a person in a network. The degree is the number of connections (or neighbors) a node has. A node with a high degree has connections with many other nodes and is often refferd to as a hub. A node with a low degree has links with only a few other nodes.
The degree distribution gives insight into the probability that a randomly selected node in the network has a certain degree. It is known that a random network has a binomial degree distribution (Barabasi, 2016). A real network, on the other hand has a degree distribution close to exponetional decline. By analysing the degree distribution of the different networks, we can take the first steps into categorizing the networks into real or random networks. If the distribution of a network is binomial, then the network is better described by the random network model. If not, it's more likely to be a real network.
fig = go.Figure()
#Add traces, one for each slider step
for step in np.arange(0.1, 1, 0.1):
random_network = nx.gnp_random_graph(1500, step)
degrees = [degree for _, degree in random_network.degree()]
degree_counts = nx.degree_histogram(random_network)
fraction_of_nodes = [count / len(random_network) for count in degree_counts]
fig.add_trace(go.Histogram(x=degrees, y=fraction_of_nodes, visible=False))
# Make 0th trace visible
fig.data[0].visible = True
# Create and add slider
steps = []
for i in range(len(fig.data)):
step = dict(
method="update",
args=[{"visible": [False] * len(fig.data)},
{"title": "Degree Distribution of a random network with probability: " + f"0.{i+1} (Figure 1)"}], # layout attribute
)
step["args"][0]["visible"][i] = True # Toggle i'th trace to "visible"
steps.append(step)
sliders = [dict(
active=10,
currentvalue={"prefix": "Probability: "},
pad={"t": 50},
steps=steps
)]
fig.update_layout(
title='Degree Distribution of a random network with probability: 0.1 (Figure 1)',
xaxis_title='Degree',
yaxis_title='Number of nodes with degree',
bargap = 0.01,
sliders=sliders
)
fig.show()
As can be seen above, a randomly generated network has a binomial structure, which means that most nodes have the same number of links and there are no or a few highly connected nodes. This is common for random networks. Next up we'll look at the networks that we're using and see if they're a random network with a binomial structure, or a real network with an exponential decline.
degree_dist = go.Figure()
for network_name, network in networks.items():
degree_sequence = sorted((d for n, d in network.degree()), reverse=True)
num_nodes = len(degree_sequence)
fraction_nodes = np.arange(num_nodes) / num_nodes
degree_dist.add_trace(go.Scatter(x=degree_sequence,
y=fraction_nodes,
mode='lines',
name=network_name))
degree_dist.update_layout(
title="Degree distribution of all networks (Figure 2)",
xaxis_title="Degree",
yaxis_title="Fraction of Nodes",
xaxis_type="log",
)
degree_dist.show()
As the above chart shows, the degree distribution of all the networks are far from binomial degree distribution of random networks. We can see there are many nodes with only a few links and a few hubs with many links. The x-axis is even made logarithmic to account for the exponential distribution of the degree. Thus, considering the degree distribution, all observed networks are most likely real networks. By clicking on the names in the index, you can eliminate certain networks from the graph.
Only few steps needed to reach any arbitrary node in the network.
Another property of real networks is that they tend to be small. What this means becomes clear after introducting the concept distance.
If it is possible to follow links to go from one node to another node in the network, it could be said that there is a path between two nodes. The path length is defined as the number of links in a path. Between the same nodes, there can be multiple paths. Therefore, the path with the least number of links is called the shortest path. With this definition, the distance between two nodes can be defined as the length of the shortest path between nodes, also called the shortest path length.
In order to investigated how close or far we expect nodes to be in a network network scientists look at the distances in a network. Subsequently, if distances are short in a network, a network is considered small. The average shortest path length, which is calculated by averaging the shortest path lengths across all pairs of nodes, can give insight into this.
$a =\sum_{\substack{s,t \in V \\ s\neq t}} \frac{d(s, t)}{n(n-1)}$
where V is the set of nodes in G,
d(s, t) is the shortest path from s to t,
and n is the number of nodes in G.
Another network property related to shortest paths, is the diameter, defined as the maximum shortest path length across all pairs of nodes.
Small world networks
Barabasi (2016) mentions another type of network introduced by Watts and Strogatz in 1998. A small world networks is a real networks with very short path lengths between nodes in a network. In other words, two individuals anywhere in the world can be connected through a few acquaintances. According to Menczer et al. (2020), the average path lengths of small world networks scale logarithmically with the size of the network.
$a \approx log(n)$
where n is the number of nodes in G.
swp = px.scatter(descriptive_stats,
x="number_of_nodes",
y="average_shortest_path_length",
size = "diameter",
trendline= "ols",
trendline_options=dict(log_x=True),
color = "diameter",
title="Relation between average path length and number of nodes (Figure 3)",
hover_name = descriptive_stats.index)
swp.update_layout(
xaxis = dict(
title = 'Amount of nodes'),
yaxis = dict(
title = 'Average (shortest) path length')
)
swp.show()
According to Barabasi (2016), the small-world property is the only property that can be explained by the random network model. Thus, with this plot, there is no way to clearly distinguish whether the networks are real or random. However, the takeaways are firstly that for some networks the average path length indeed scales logarithmically with the size of the network. Secondly, there are only a handful of networks with a short average path lengths compared to the number of nodes in that network. Unsurprisingly these are also the networks with the shortest diameter. Thirdly, this tells us something about the way people interact with each other in different networks. In some networks the interactions create highly clustered networks, resulting in a lower average path length. In other networks the interactions create lower clustered networks resulting, in a lower average path length.
The effect of size on random networks.
Another way to analyse the randomness of our networks, is by checking the effect of their size on the average clustering coefficient. Barabasi (2016) states that in a random network, the average clustering coefficient is independent of the average degree but depends on the network size. In contrast to random networks, the average clustering coefficient of real networks decrease by the average degree and does not depend on the network size (Barabasi, 2016). By plotting a multivariate visualization that contains these three variables, we might be able to find a trend in our dataset that concerns the previous statement.
bubble = px.scatter(descriptive_stats,
x = 'average_degree',
y = 'avg_cc', size = 'number_of_nodes',
labels = list(descriptive_stats.index),
hover_name = descriptive_stats.index,
color = descriptive_stats.index,
log_x = True, size_max = 60)
bubble.update_layout( title = 'Average clustering coefficient per average degree (Figure 4)',
xaxis = dict( title = 'Average degree',
gridcolor = 'white',
type = 'log',
gridwidth = 2, ),
yaxis = dict( title = 'Average clustering coefficient',
gridcolor = 'white',
gridwidth = 2, ),
paper_bgcolor = 'rgb(243, 243, 243)',
plot_bgcolor = 'rgb(243, 243, 243)', )
bubble.show()
This bubble chart shows that the smaller networks generally have a higher average clustering coefficient. However, Nashville doesn't seem to follow the same trend as the other networks due to it's distance and size. Consequently, Nashville resembles a real network as it's average clustering coefficient doesn't seem to be affected by it's size. This can be better seen in the graph by removing Nashville from the graph, which can be done by clicking on Nashville in the index.
Meet new friends through shared contacts.
To go futher into the interactions that create networks, we look at triadic closure. Triadic closure describes the phenomenon that people tend to meet new friends through shared contacts. It is called triadic closure because in a network graph a triangle is closed when someone meets a friend of a friend. The clustering coefficient describes how interconnected the neighbors of a particular node are with each other. Or in other words the fraction of the nodes possible triangles that are actually closed.
$$ c_u = \frac{2 T(u)}{deg(u)(deg(u)-1)}, $$whereT(u) is the number of triangles through node u and
deg(u) is the degree of u.
The average clustering coefficient of a network is defined as the clustering coefficient over all the nodes. According to Barabasi (2016) real networks with a small world property tend to have a high average clustering coefficient.
#colors = px.colors.qualitative.Plotly[:len(networks)]
#"blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue","blue"
avg_cc = px.bar(descriptive_stats,
x=list(networks.keys()),
y='avg_cc',
color = list(networks.keys()),
color_discrete_sequence= ["#1302ab","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7"],
title='Nashville Network clearly has a significant higher clustering coefficient (Figure 5)')
avg_cc.update_layout(xaxis_title='Networks', yaxis_title='Average clustering coefficient', showlegend=False)
avg_cc.show()
The takeaway of this bar chart is that Nashville resembles a real social network due to it's significantly higher average clustering coefficient in comparison to the other networks.
Clustering tells us that many nodes have common neighbors. According to Barabasi (2016) in a real network, high degree nodes tend to have a smaller clustering coefficient than low degree nodes. The more neighbors you have, the harder it is for your neighbors to know the same neighbors as you. This relationship is absent in random networks. Since Nashville Meetup Network is the network with the highest average clustering coefficient, this network is chosen first to plot.
In order to minimize the loading time of the jupyter notebook, the computations are done in advance and written to a csv file, which then can be loaded in.
networks_with_degree_cc_file_path = {
'NashvilleMeetupNetwork' : 'results/NashvilleMeetupNetwork_degree_vs_cc.csv',
'DeezerHR' : 'results/DeezerHR_degree_vs_cc.csv',
'DeezerHU' : 'results/DeezerHU_degree_vs_cc.csv',
'DeezerRO' : 'results/DeezerRO_degree_vs_cc.csv',
'FacebookLargepage' : 'results/FacebookLargepage_degree_vs_cc.csv',
'FeatherDeezerSocial' : 'results/FeatherDeezerSocial_degree_vs_cc.csv',
'FeatherLastfmSocial' : 'results/FeatherLastfmSocial_degree_vs_cc.csv',
'GithubSocial' : 'results/GithubSocial_degree_vs_cc.csv',
'TwitchSocialNetworksDE' : 'results/TwitchSocialNetworksDE_degree_vs_cc.csv',
'TwitchSocialNetworksENGB' : 'results/TwitchSocialNetworksENGB_degree_vs_cc.csv',
'TwitchSocialNetworksES' : 'results/TwitchSocialNetworksES_degree_vs_cc.csv',
'TwitchSocialNetworksFR' : 'results/TwitchSocialNetworksFR_degree_vs_cc.csv',
'TwitchSocialNetworksPTBR' : 'results/TwitchSocialNetworksPTBR_degree_vs_cc.csv',
'TwitchSocialNetworksRU' : 'results/TwitchSocialNetworksRU_degree_vs_cc.csv',
'SocSignSlashdot090221' : 'results/SocSignSlashdot090221_degree_vs_cc.csv',
'SocSlashdot0902' : 'results/SocSlashdot0902_degree_vs_cc.csv',
}
def write_cc_and_degree_to_csv(graph, file_path):
"""Computes cc and degree for each node and writes it to csv file."""
degree_dict = dict(graph.degree())
cc_dict = dict(nx.clustering(graph))
data = {'degree': degree_dict, 'cc' : cc_dict}
degree_cc = pd.DataFrame(data)
degree_cc.to_csv(file_path)
return
def calculate_correlation(file_path):
"""Return correlation between cc and degree."""
degree_cc = pd.read_csv(file_path, index_col=0)
return degree_cc.corr()['cc'].loc['degree']
## for loop to calculate cc and degree for all networks
# for name, network in networks.items():
# write_cc_and_degree_to_csv(network, networks_with_degree_cc_file_path[name])
Plot Nashville Meetup network:
NashVilleMeetupNetwork_degree_cc_ = pd.read_csv('results/NashvilleMeetupNetwork_degree_vs_cc.csv', index_col=0)
fig = px.scatter(NashVilleMeetupNetwork_degree_cc_, x="degree", y="cc", title="Clustering coefficient VS degree for Nashville Meetup Network (Figure 6)")
fig.show()
NashVilleMeetupNetwork_degree_cc_.corr()
| degree | cc | |
|---|---|---|
| degree | 1.000000 | -0.645953 |
| cc | -0.645953 | 1.000000 |
Indeed, Nashville Meetup Network looks like a real network. A pearson correlation of -0.65 further substantiates this statement. Let's now look at the complete opposite, Github Social.
GithubSocial_degree_cc_ = pd.read_csv('results/GithubSocial_degree_vs_cc.csv', index_col=0)
fig = px.scatter(GithubSocial_degree_cc_,
x="degree",
y="cc",
title="Clustering coefficient VS degree for Github Social (Figure 7)",
)
fig.show()
GithubSocial_degree_cc_.corr()
| degree | cc | |
|---|---|---|
| degree | 1.000000 | -0.046342 |
| cc | -0.046342 | 1.000000 |
Github Social on the other hand, does not have this real network property with a pearson correlation close to zero. It is unnecessary to plot this for every single network, to show the absence of this relationship. Therefore we construct a dataframe with all the pearson correlations.
cc_degree_corr = pd.DataFrame(index=networks.keys(), columns=['correlation'])
for name, file_path in networks_with_degree_cc_file_path.items():
cc_degree_corr['correlation'].loc[name] = calculate_correlation(file_path)
cc_degree_corr
| correlation | |
|---|---|
| NashvilleMeetupNetwork | -0.645953 |
| DeezerHR | -0.029309 |
| DeezerHU | -0.059689 |
| DeezerRO | 0.001017 |
| FacebookLargepage | -0.017038 |
| FeatherDeezerSocial | -0.044112 |
| FeatherLastfmSocial | 0.042086 |
| GithubSocial | -0.046342 |
| TwitchSocialNetworksDE | -0.123471 |
| TwitchSocialNetworksENGB | -0.041845 |
| TwitchSocialNetworksES | -0.133733 |
| TwitchSocialNetworksFR | -0.161227 |
| TwitchSocialNetworksPTBR | -0.210791 |
| TwitchSocialNetworksRU | -0.063507 |
| SocSignSlashdot090221 | -0.027151 |
| SocSlashdot0902 | -0.027694 |
This tendency to close triangles creates real networks with low distances, hubs, and high clustering. To better show how these interactions work, we look at the Barabasi-Albert model. This model was created to give insight into attachment which plays an important rol in real networks (Barabasi, 2016). In this model a network is created by adding new nodes and connecting them to a defined number of nodes that already exist in the network. This way of building networks corresponds more to triadic closure than networks created by the random network model. This dynamic also creates networks with hubs, which play a crucial rol in keeping the distance in networks small. As a concequence, the degree distribution looks more like one from a real network.
barabasi = nx.barabasi_albert_graph(11372, 100)
degree_sequence = [d for n, d in barabasi.degree()]
degree_count = nx.degree_histogram(barabasi)
degree_distribution = [count / sum(degree_count) for count in degree_count]
fig = go.Figure(data=go.Histogram(x=degree_sequence, y=degree_distribution))
fig.update_layout(
title="Degree Distribution of Barabasi-Albert network (Figure 8)",
xaxis_title="Degree",
yaxis_title="Number of nodes with degree",
bargap = 0.01,
)
fig.show()
In this data story, we explored the extent to which online social interactions are similar to real social interactions and the extent to which they differ. This was done by comparing different properties of the networks that emerge out of these different social interactions.
Through the analysis of degree distribution, we observed that the networks derived from social interactions differed from the binomial degree distribution of random networks, indicating that they are more likely real networks. Additionally, we calculated the average shortest path length and diameter to assess the small world property of the networks. While we couldn't conclusively determine if the networks were real or random based on this property alone, we noted that some networks exhibited shorter average path lengths, suggesting a higher degree of connectivity.
Furthermore, we explored the concept of triadic closure, where individuals tend to meet new friends through shared contacts. By examining the clustering coefficient, we found that real networks with a small world property tend to have a higher average clustering coefficient. This highlighted the interconnectedness of neighbors within the networks and supported the notion that social interactions in real networks often involve common acquaintances.
Based on the analysis, it can be concluded that Nashville Meetup Network is the only network that satisfies all the properties of a real network. The other networks in the analysis lacked high clustering coefficients and did not exhibit the same characteristics as real networks. The Nashville Meetup Network is the only network where the edges represent real world interactions and not interactions in a digital environment. This confirms our first perspective, which was that online social networks based on real life social interactions were more likely to resemble real networks. This perspective is confirmed because Nashville, our network which was based on real life social interaction, looked way more like a real network than our other networks that lacked those real life social interactions.
However, our data doesn't confirm our second perspective, which was that online social networks without real life social interaction were more likely to resemble random networks. This is because in some cases, our networks without real life social interaction did not resemble random networks.
These findings emphasize that humans are social creatures who thrive on real-world connections and shared experiences. In the digital realm, the dynamics of meeting new friends through shared contacts are not as prevalent. Online social networks are often characterized by a broader and more diverse range of connections, with less emphasis on close-knit communities.
We received one point of feedback from our classmates who reviewed our draft, which was that we unnecessarily showed the index in Figure 3. We listened to the feedback and decided to remove the index, to leave more space for the graph.
Our TA gave us multiple points of feedback to work with.
I took on most of the programming work, wrote different pieces of text. I also played an important role in searching many relevant concepts and ideas to incorporate into the project.
I made some of the graphs, helped with the perspectives, wrote text for the introduction, wrote text for the conclusion, helped with the peer feedback and wrote the reflection. I also did multiple spelling checks while we were making this project.
I made a few graphs, helped with some text for the introduction, wrote initial text for the degree distribution and made other minor changes, such as a spelling check, and headlines for appendices.
I made the bubble chart visualization, helped with the fraction of nodes degree distribution, completed the dataset table, helped formulating the structure for perspectives and wrote about triadic closure and bigger network.
Barabasi, A. L. (2016). Network Science. Cambridge: Cambridge University Press. http://networksciencebook.com/
Menczer, F., Fortunato, S., & Davis, C. (2020). A First Course in Network Science. Cambridge: Cambridge University Press. doi:10.1017/9781108653947
Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’ networks. Nature, 393(6684), 440–442. https://doi.org/10.1038/30918
https://github.com/raul-vh/groepsproject-invi
Nashville https://www.kaggle.com/datasets/stkbailey/nashville-meetup?select=rsvps.csv
Deezer https://www.kaggle.com/datasets/andreagarritano/deezer-social-networks
Social Graphs https://www.kaggle.com/datasets/wolfram77/graphs-social